1 Statement of Authorship

This report is created by Ekene Olatunji, Catherine Nassralla, Victoria Chin, Bassma Chamas, and Sajiya Somji for the eHealth 705 Statistics in eHealth final project.

2 Executive Summary

Using the principles taught during the eHealth 705 course, the ICU Admissions dataset was explored, analyzed and a logistic regression model was created. Logistic regression was used to create a model as the response variable, Status, was categorical, as were several of the predictor variables in this dataset. Using graphs, descriptive statistics and summaries, the attributes were analyzed and recoded if needed, as well as interpreted. Several of the attributes that were focused on related to patient characteristics at admission, as it might be valuable to understand features of an individual’s medical situation that might make them more likely to survive or die during an ICU admission.

(include reasons why we did tests, summary of what was found. ) - overview of tests that we performed and some of the results - maybe include why we chose some of the tests - since the main predictor variable was vital status, most of the analyses and exploration was about this variable

3 About the ICU Admissions Dataset

3.1 Description

The ICU Admission dataset contains observations of patients that were admitted to an adult intensive care unit (ICU). The response variable of this dataset is Status of patients after admission, that is, whether they lived or died after admission to ICU. A description of the variables is available below.

Number of cases: 200
Variable Names:

  1. ID: ID number of the patient
  2. STA: Vital status (0 = Lived, 1 = Died)
  3. AGE: Patient’s age in years
  4. SEX: Patient’s sex (0 = Male, 1 = Female)
  5. RACE: Patient’s race (1 = White, 2 = Black, 3 = Other)
  6. SER: Service at ICU admission (0 = Medical, 1 = Surgical)
  7. CAN: Is cancer part of the present problem? (0 = No, 1 = Yes)
  8. CRN: History of chronic renal failure (0 = No, 1 = Yes)
  9. INF: Infection probable at ICU admission (0 = No, 1 = Yes)
  10. CPR: CPR prior to ICU admission (0 = No, 1 = Yes)
  11. SYS: Systolic blood pressure at ICU admission (in mm Hg)
  12. HRA: Heart rate at ICU admission (beats/min)
  13. PRE: Previous admission to an ICU within 6 months (0 = No, 1 = Yes)
  14. TYP: Type of admission (0 = Elective, 1 = Emergency)
  15. FRA: Long bone, multiple, neck, single area, or hip fracture (0 = No, 1 = Yes)
  16. PO2: PO2 from initial blood gases (0 = >60, 1 = ²60)
  17. PH: PH from initial blood gases (0 = ³7.25, 1 <7.25)
  18. PCO: PCO2 from initial blood gases (0 = ²45, 1 = >45)
  19. BIC: Bicarbonate from initial blood gases (0 = ³18, 1 = <18)
  20. CRE: Creatinine from initial blood gases (0 = ²2.0, 1 = >2.0)
  21. LOC: Level of consciousness at admission (0 = no coma or stupor, 1= deep stupor, 2 = coma)

4 Reading in the ICU Admissions Dataset

> ICU <- read.table("./ICUAdmissions.csv", header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)

4.1 Structure of the ICU dataset

> str(ICU)
'data.frame':   200 obs. of  21 variables:
 $ ID           : int  8 12 14 28 32 38 40 41 42 50 ...
 $ Status       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Age          : int  27 59 77 54 87 69 63 30 35 70 ...
 $ Sex          : int  1 0 0 0 1 0 0 1 0 1 ...
 $ Race         : int  1 1 1 1 1 1 1 1 2 1 ...
 $ Service      : int  0 0 1 0 1 0 1 0 0 1 ...
 $ Cancer       : int  0 0 0 0 0 0 0 0 0 1 ...
 $ Renal        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Infection    : int  1 0 0 1 1 1 0 0 0 0 ...
 $ CPR          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Systolic     : int  142 112 100 142 110 110 104 144 108 138 ...
 $ HeartRate    : int  88 80 70 103 154 132 66 110 60 103 ...
 $ Previous     : int  0 1 0 0 1 0 0 0 0 0 ...
 $ Type         : int  1 1 0 1 1 1 0 1 1 0 ...
 $ Fracture     : int  0 0 0 1 0 0 0 0 0 0 ...
 $ PO2          : int  0 0 0 0 0 1 0 0 0 0 ...
 $ PH           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PCO2         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Bicarbonate  : int  0 0 0 0 0 1 0 0 0 0 ...
 $ Creatinine   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Consciousness: int  1 1 1 1 1 1 1 1 1 1 ...

The ICU Admissions dataset consists of 200 observations with 21 variables. From these observations we found;

  1. The dependent variable is the binary variable Vital Status (Status).
  2. Nineteen possible predictor variables, both discrete and continuous, were also observed.
  3. Most of the variables are integer but from information about the data, most of the variables can be recoded to factor variables. Several of the variables have two to three levels categories and so while they are numerical, their means do not offer any meaningful information.
  4. There are only four variables that can be left as numerical variables others can be recoded to categorical/factor variables.
  5. There are no missing data.

5 Recoding Variables

5.1 Converting Numerical Variables to Factor Variables

Attributes that could be recoded as factors included Status, Sex, Race, Service, Cancer, Renal, Infection, CPR, Previous, Type, Fracture, PCO2, PH, PO2, Bicarbonate, Creatinine and Consciousness. Labelling the factor levels with level names helps with comparative analysis and visualization.

> ICU <- within(ICU, {
+   Status <- factor(Status, labels=c('Lived','Died'))
+   Sex <- factor(Sex, labels=c('Male','Female'))
+   Race <- factor(Race, labels=c('White','Black','Other'))
+   Service <- factor(Service, labels=c('Medical','Surgical'))
+   Cancer <- factor(Cancer, labels=c('No','Yes'))
+   Renal <- factor(Renal, labels=c('No','Yes'))
+   Infection <- factor(Infection, labels=c('No','Yes'))
+   CPR <- factor(CPR, labels=c('No','Yes'))
+   Previous <- factor(Previous, labels=c('No','Yes'))
+   Type <- factor(Type, labels=c('Elective','Emergency'))
+   Fracture <- factor(Fracture, labels=c('No','Yes'))
+   PCO2 <- factor(PCO2, labels=c('No','Yes'))
+   PH <- factor(PH, labels=c('No','Yes'))
+   PO2 <- factor(PO2, labels=c('No','Yes'))
+   Bicarbonate <- factor(Bicarbonate, labels=c('No','Yes'))
+   Creatinine <- factor(Creatinine, labels=c('No','Yes'))
+   Consciousness <- factor(Consciousness, labels=c('Conscious','Deep Stupor','Coma'))
+ })

5.2 Recoding Numeric variables into bins

Several tests required recoding some of our numerical variables into bins. For Age, bins were created using equal count bins. Systolic was binned by blood pressure categories. HeartRate was also binned based on categories of bradycardia and elevated heart rate. the

Binning of the Age variable with equal-count bins:

> ICU$Age.binned <- 
+   with(ICU, binVariable(Age,
+    bins=5, method='proportions', 
+   labels=c('Group 1','Group 2','Group 3',
+   'Group 4','Group 5')))

Binning of the Systolic variable by hypotension and hypertension categories:

> ICU <- 
+   within(ICU, {
+   Systolic.grouped <- Recode(Systolic, 
+   '0:80="hypotension"; 80:120="normal"; 120:129="elevated"; 130:139="stage 1 hypertension"; 140:180="stage 2 hypertension"; 180:260="stage 3 hypertension"; ;',
+    as.factor=TRUE)
+ })

Binning of the HeartRate variable by groups of bradycardia, normal and elevated heart rate:

> ICU <- 
+   within(ICU, {
+   HeartRate.grouped <- Recode(HeartRate, 
+   '0:60="bradycardia"; 60:100="normal"; 100:200="elevated"',
+    as.factor=TRUE)
+ })

5.3 Saving Recoded Dataset

> # write.csv(ICU, file="ICUAdmissions_recoded.csv", row.names=FALSE)

6 Data Exploration - Statistical Summaries and Graphing of Variables

6.1 Preview of Recoded Dataset

Below is a table containing results from the first and last four variables in this dataset.

> headTail(ICU) %>% datatable(rownames = TRUE, filter="top", options = list(pageLenght = 10, scrollX=T))%>% formatRound(columns=c(1:17), digits=0)
> str(ICU)
'data.frame':   200 obs. of  24 variables:
 $ ID               : int  8 12 14 28 32 38 40 41 42 50 ...
 $ Status           : Factor w/ 2 levels "Lived","Died": 1 1 1 1 1 1 1 1 1 1 ...
 $ Age              : int  27 59 77 54 87 69 63 30 35 70 ...
 $ Sex              : Factor w/ 2 levels "Male","Female": 2 1 1 1 2 1 1 2 1 2 ...
 $ Race             : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 1 1 1 2 1 ...
 $ Service          : Factor w/ 2 levels "Medical","Surgical": 1 1 2 1 2 1 2 1 1 2 ...
 $ Cancer           : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
 $ Renal            : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Infection        : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 1 1 1 1 ...
 $ CPR              : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Systolic         : int  142 112 100 142 110 110 104 144 108 138 ...
 $ HeartRate        : int  88 80 70 103 154 132 66 110 60 103 ...
 $ Previous         : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 1 1 1 ...
 $ Type             : Factor w/ 2 levels "Elective","Emergency": 2 2 1 2 2 2 1 2 2 1 ...
 $ Fracture         : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 1 ...
 $ PO2              : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
 $ PH               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ PCO2             : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Bicarbonate      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
 $ Creatinine       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ Consciousness    : Factor w/ 3 levels "Conscious","Deep Stupor",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Age.binned       : Factor w/ 5 levels "Group 1","Group 2",..: 1 3 5 2 5 4 3 1 1 4 ...
 $ Systolic.grouped : Factor w/ 6 levels "elevated","hypotension",..: 5 3 3 5 3 3 3 5 3 4 ...
 $ HeartRate.grouped: Factor w/ 3 levels "bradycardia",..: 3 3 3 2 2 2 3 2 1 2 ...

Findings

  • The following variables were successfully recoded; Status, Sex, Race, Service, Cancer, Renal, Infection, CPR, Previous, Type,Fracture, PCO2,PH, PO2, Bicarbonate, Creatinine, and Consciousness.

6.2 Statistical Summary of the Recoded Dataset

> summary(ICU)
       ID          Status         Age            Sex         Race    
 Min.   :  4.0   Lived:160   Min.   :16.00   Male  :124   White:175  
 1st Qu.:210.2   Died : 40   1st Qu.:46.75   Female: 76   Black: 15  
 Median :412.5               Median :63.00                Other: 10  
 Mean   :444.8               Mean   :57.55                           
 3rd Qu.:671.8               3rd Qu.:72.00                           
 Max.   :929.0               Max.   :92.00                           
     Service    Cancer    Renal     Infection  CPR         Systolic    
 Medical : 93   No :180   No :181   No :116   No :187   Min.   : 36.0  
 Surgical:107   Yes: 20   Yes: 19   Yes: 84   Yes: 13   1st Qu.:110.0  
                                                        Median :130.0  
                                                        Mean   :132.3  
                                                        3rd Qu.:150.0  
                                                        Max.   :256.0  
   HeartRate      Previous         Type     Fracture   PO2        PH     
 Min.   : 39.00   No :170   Elective : 53   No :185   No :184   No :187  
 1st Qu.: 80.00   Yes: 30   Emergency:147   Yes: 15   Yes: 16   Yes: 13  
 Median : 96.00                                                          
 Mean   : 98.92                                                          
 3rd Qu.:118.25                                                          
 Max.   :192.00                                                          
  PCO2     Bicarbonate Creatinine     Consciousness   Age.binned
 No :180   No :185     No :190    Conscious  :185   Group 1:41  
 Yes: 20   Yes: 15     Yes: 10    Deep Stupor:  5   Group 2:39  
                                  Coma       : 10   Group 3:43  
                                                    Group 4:44  
                                                    Group 5:33  
                                                                
             Systolic.grouped   HeartRate.grouped
 elevated            :17      bradycardia: 15    
 hypotension         :12      elevated   : 80    
 normal              :61      normal     :105    
 stage 1 hypertension:30                         
 stage 2 hypertension:65                         
 stage 3 hypertension:15                         

6.3 Numerical Summary of Integer Variables

> numSummary(ICU[,c("Age", "HeartRate", "Systolic"), drop=FALSE], statistics=c("mean", "sd", "IQR", 
+   "quantiles"), quantiles=c(0,.25,.5,.75,1))
             mean       sd   IQR 0%    25% 50%    75% 100%   n
Age        57.545 20.05465 25.25 16  46.75  63  72.00   92 200
HeartRate  98.925 26.82962 38.25 39  80.00  96 118.25  192 200
Systolic  132.280 32.95210 40.00 36 110.00 130 150.00  256 200

Findings

  • Age ranges from 16-92 years old with a mean of 57.55 and median of 63
  • Systolic blood pressure ranges from 36-256 mmHg with a mean of 132.3 and median of 130
  • Heart rate ranges from 39-192 beats per minute with a mean of 98.92 and median of 96

7 Data Exploration - Graphs

7.1 Examining the Variables Distribution

================================================================================

   Variable   p_1 p_10   p_25  p_50   p_75  p_90   p_99
1  Systolic 55.92   92    110   130    150   170 212.12
2       Age 16.99   21  46.75    63     72    78     91
3 HeartRate 45.98   65     80    96 118.25 136.1 162.08
4        ID 11.96 81.3 210.25 412.5 671.75 829.8 924.01

Using the xray package, general trends in the dataset were identified.

  1. There are more male observations in the dataset than females.

  2. Many of the admissions to the ICU were emergencies, with a about a quarter of admissions being elective. This could relate to elective surgical procedures where morbidity could have been high or complications occurred. It is unclear whether people would be preemptively admitted to the ICU for high-risk procedures or if these elective admissions could be thought of as unforeseen or emergencies in themselves.

  3. Looking at Status, more than 75% of observations lived after their ICU admission.

  4. While the numbers are roughly even, slightly more procedures were surgical.

  5. More admitted patients had no chronic renal failure, no previous admissions to the ICU, fracture, CPR or cancer when admitted. Most admitted patients had PO2 above or equal to 60, blood pH above or equal to 7.25, PCO2 below or equal to 45, and creatinine below or equal to 2. Much more of the observations were also white. Due to the large difference between groups some of these variables were not further explored, and the focus of exploratory data analysis and understanding of distribution was on variables where values might have been prior to admission. These might aid clinicians who work in the ICU in understanding the likelihood of survival for individuals who are admitted before lab work has been ordered.

  6. Most patients admitted were conscious, with slighly more patients being comatose than in a deep stupor if unconscious.

  7. Looking at histograms of the three numerical variables in the dataset, which were Age, Systolic and HeartRate, it could be guess that if any of the distributions were to be normal, they would be Systolic and HeartRate. Age is very obviously not normally distributed. Systolic looks like it is centered around 140, and the mean confirms this as mentioned above. The mean of HeartRate seems to be centered around 100 and its mean confirms this visual estimate.

7.2 Distribution of Vital Status by Factor Variables

> library(ggplot2)
> f01<-ggplot(ICU, aes(x=Sex, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Sex")
> 
> f02<-ggplot(ICU, aes(x=Race, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Race")
> f03<-ggplot(ICU, aes(x=Service, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Service")
> 
> f04<-ggplot(ICU, aes(x=Cancer, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Cancer")
> 
> library(Rmisc)
> multiplot(f01, f02, f03, f04,  layout=matrix(c(1:4), nrow=2, byrow=TRUE))

> f05<-ggplot(ICU, aes(x=Renal, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Renal")
> 
> f06<-ggplot(ICU, aes(x=Infection, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Infection")
> 
> f07<-ggplot(ICU, aes(x=CPR, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by CPR")
> 
> f08<-ggplot(ICU, aes(x=Previous, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Previous")
> 
> library(Rmisc)
> multiplot(f05, f06, f07, f08,  layout=matrix(c(1:4), nrow=2, byrow=TRUE))

> f09<-ggplot(ICU, aes(x=Type, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Type")
> 
> f10<-ggplot(ICU, aes(x=Fracture, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Fracture")
> 
> f11<-ggplot(ICU, aes(x=PO2, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by PO2")
> 
> f12<-ggplot(ICU, aes(x=PH, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by PH")
> 
> library(Rmisc)
> multiplot(f09, f10, f11, f12,  layout=matrix(c(1:4), nrow=2, byrow=TRUE))

> f13<-ggplot(ICU, aes(x=PCO2, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by PCO2")
> 
> f14<-ggplot(ICU, aes(x=Bicarbonate, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Bicarbonate")
> 
> f15<-ggplot(ICU, aes(x=Creatinine, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Creatinine")
> 
> f16<-ggplot(ICU, aes(x=Consciousness, fill = Status)) +
+  theme_bw() +
+  geom_bar() +
+  labs(y = "Patient Count",
+       title = "Vital Status by Consciousness")
> 
> 
> library(Rmisc)
> multiplot(f13, f14, f15, f16, layout=matrix(c(1:4), nrow=2, byrow=TRUE))

7.3 Observations

  • The number of patient who live is higher than the population who died
  • More Patients from the population aged 50 and above died in both male and female
  • The female population experienced more death than men of the same age bracket as
  • The density plot is peaked at the top at the age 75 for both male and female meaning more patients aged 75 died.

8 Uniform Distribution

The distribution of the ICU attributes will be explored below. Many of the attributes chosen to be tested here relate to the history of the individuals in the data set or the circumstances of the admission, rather than lab tests on admission. Most of the categorical attributes in this dataset have two categories, making the degrees of freedom 1 for chi squared distribution.

Theoretical chi square distribution with 1 degrees of freedom

Theoretical chi square distribution with 1 degrees of freedom

For Categorical variables with two categories:
H0: the attribute is distributed uniformally, the chi squared value does not exceed the hypothesized value of 3.841 (p = 0.05)
Ha: the attribute is not distiributed uniformally, The chi squared value exceeds the hypothesized value of 3.841

Chi squared test - Status:

The chi squared test for Status returns a chi squared value above the null hypothesized value with a very small p value (<0.05), indicating that there is a very low risk of Type I error if the null hypothesis is rejected. It can be concluded that Status is not uniformally distributed and that statistically more individuals lived than died on admission to the ICU.

> local({
+   .Table <- with(ICU, table(Status))
+   cat("\ncounts:\n")
+   print(.Table)
+   cat("\npercentages:\n")
+   print(round(100*.Table/sum(.Table), 
+   2))
+   .Probs <- c(0.5,0.5) 
+   chisq.test(.Table, p=.Probs)
+ })

counts:
Status
Lived  Died 
  160    40 

percentages:
Status
Lived  Died 
   80    20 

    Chi-squared test for given probabilities

data:  .Table
X-squared = 72, df = 1, p-value < 2.2e-16

Chi squared test - CPR:

Similarly, the chi squared test for CPR returns a very high chi squared value at 151, and a low p value, allowing us to reject the null hypothesis that this attribute is uniformally distributed. More individuals did not have CPR than did before ICU admission. In fact, as seen from the table below, few individuals received CPR at all, which might make it a poor predictor of ICU outcomes.

> local({
+   .Table <- with(ICU, table(CPR))
+   cat("\ncounts:\n")
+   print(.Table)
+   cat("\npercentages:\n")
+   print(round(100*.Table/sum(.Table), 
+   2))
+   .Probs <- c(0.5,0.5) 
+   chisq.test(.Table, p=.Probs)
+ })

counts:
CPR
 No Yes 
187  13 

percentages:
CPR
  No  Yes 
93.5  6.5 

    Chi-squared test for given probabilities

data:  .Table
X-squared = 151.38, df = 1, p-value < 2.2e-16
> table(ICU$CPR, ICU$Status)
     
      Lived Died
  No    154   33
  Yes     6    7

Chi squared test - Infection:

The chi squared value produced from this test is only slightly higher than the null hypothesized value, with a p value of 0.024, indicating a 2.4% chance of a Type I error if the null hypothesis is rejected. While the distribution can be concluded to not be uniform based on the threshold set with the null hypothesis, the uniformity of Infection among patients admitted to the ICU could be explored in further research. It should be noted that in the group of individuals who died, more had infections than not.

> local({
+   .Table <- with(ICU, table(Infection))
+   cat("\ncounts:\n")
+   print(.Table)
+   cat("\npercentages:\n")
+   print(round(100*.Table/sum(.Table), 
+   2))
+   .Probs <- c(0.5,0.5) 
+   chisq.test(.Table, p=.Probs)
+ })

counts:
Infection
 No Yes 
116  84 

percentages:
Infection
 No Yes 
 58  42 

    Chi-squared test for given probabilities

data:  .Table
X-squared = 5.12, df = 1, p-value = 0.02365
> table(ICU$Infection, ICU$Status)
     
      Lived Died
  No    100   16
  Yes    60   24

Chi squared test - Previous:

The chi squared value here exceeds the null hypothesized value with a p value much smaller than 0.05, allowing us to reject the null hypothesis that this attribute is uniformally distributed. More individuals in both status categories did not have a previous ICU admission.

> local({
+   .Table <- with(ICU, table(Previous))
+   cat("\ncounts:\n")
+   print(.Table)
+   cat("\npercentages:\n")
+   print(round(100*.Table/sum(.Table), 
+   2))
+   .Probs <- c(0.5,0.5) 
+   chisq.test(.Table, p=.Probs)
+ })

counts:
Previous
 No Yes 
170  30 

percentages:
Previous
 No Yes 
 85  15 

    Chi-squared test for given probabilities

data:  .Table
X-squared = 98, df = 1, p-value < 2.2e-16

Chi squared test - Sex:

The chi squared value returned from this test is 11.52, which is not as far from the null hypothesized value as some of the other values generated from other tests. The value and p value of 0.00069 still indicated that the null hypothesis should be reject and that Sex is not uniformally distributed. There are more men in this dataset than females.

> local({
+   .Table <- with(ICU, table(Sex))
+   cat("\ncounts:\n")
+   print(.Table)
+   cat("\npercentages:\n")
+   print(round(100*.Table/sum(.Table), 
+   2))
+   .Probs <- c(0.5,0.5) 
+   chisq.test(.Table, p=.Probs)
+ })

counts:
Sex
  Male Female 
   124     76 

percentages:
Sex
  Male Female 
    62     38 

    Chi-squared test for given probabilities

data:  .Table
X-squared = 11.52, df = 1, p-value = 0.0006885

In both sex categories, despite the inequality in the number of observations in each group, there are still more individuals who lived than died.

Chi squared test - Type:

The distribution of Type can be concluded to be non-uniform as the null hypothesis should be rejected based on the chi squared value produced and low p value. As seen in exploratory data analysis, the there are more individuals who are had emergency admissions than elective admissions. This difference in distribution is statistically significant. Fewer individuals died when they were admitted to the ICU on an elective basis.

> local({
+   .Table <- with(ICU, table(Type))
+   cat("\ncounts:\n")
+   print(.Table)
+   cat("\npercentages:\n")
+   print(round(100*.Table/sum(.Table), 
+   2))
+   .Probs <- c(0.5,0.5) 
+   chisq.test(.Table, p=.Probs)
+ })

counts:
Type
 Elective Emergency 
       53       147 

percentages:
Type
 Elective Emergency 
     26.5      73.5 

    Chi-squared test for given probabilities

data:  .Table
X-squared = 44.18, df = 1, p-value = 2.995e-11

Chi squared test - Service:

The distribution of Service is uniform based on the null hypothesis of a chi squared variable of 3.841. The chi squared value here indicates that there is not a significant difference between a uniform distribution and the distributio not Service. This means that there was no significant difference between the number of individuals admitted who had medical or surgical services performed, as the p value indicates a 32% risk of incorrectly rejecting the null hypothesis.

> local({
+   .Table <- with(ICU, table(Service))
+   cat("\ncounts:\n")
+   print(.Table)
+   cat("\npercentages:\n")
+   print(round(100*.Table/sum(.Table), 
+   2))
+   .Probs <- c(0.5,0.5) 
+   chisq.test(.Table, p=.Probs)
+ })

counts:
Service
 Medical Surgical 
      93      107 

percentages:
Service
 Medical Surgical 
    46.5     53.5 

    Chi-squared test for given probabilities

data:  .Table
X-squared = 0.98, df = 1, p-value = 0.3222

For Categorical variables with three categories:
H0: the attribute is distributed uniformally, the chi squared value does not exceed the hypothesized value of 5.992 (p = 0.05)
Ha: the attribute is not distiributed uniformally, The chi squared value exceeds the hypothesized value of 5.992 (p=0.05)

Theoretical chi square distribution with 2 degrees of freedom

Theoretical chi square distribution with 2 degrees of freedom

Chi squared test - Consciousness:

The null hypothesis for this attribute’s distribution is that an equal number of observations will be found in each category of Consciousness. The chi squared value produced by this test exceeds the null hypothesized value with a low p value, indicating that the observations in this category are statistically not uniform. This could have been guessed from counts of observations in each category, where most individuals were in the conscious category.

> local({
+   .Table <- with(ICU, 
+   table(Consciousness))
+   cat("\ncounts:\n")
+   print(.Table)
+   cat("\npercentages:\n")
+   print(round(100*.Table/sum(.Table), 
+   2))
+   .Probs <- c(0.333333333333333,
+   0.333333333333333,0.333333333333333) 
+   chisq.test(.Table, p=.Probs)
+ })

counts:
Consciousness
  Conscious Deep Stupor        Coma 
        185           5          10 

percentages:
Consciousness
  Conscious Deep Stupor        Coma 
       92.5         2.5         5.0 

    Chi-squared test for given probabilities

data:  .Table
X-squared = 315.25, df = 2, p-value < 2.2e-16

9 Normal Distribution

Distribution of the three numerical variables in the ICU dataset was evaluated using quantile-quantile plots and the Shapiro-Wilk’s test for normality. Confidence intervals will be produced for the means of each attribute.

H0: Age, HeartRate and Systolic are normally distributed
Ha: Age, HeartRate and Systolic are not normally distributed

Tests of normality - Age:

Recall that a quantile-quantile plot should produce a nearly linear plot using dataset values, with the intercept going through zero, if the null hypothesis is satisfied. Below, the plotted points of Age do not conform well to the line of best fit, and are frequently outside of the confidence bands. This supports what might have been hypothesized when the histogram of Age was produced: this distribution is not clearly centered around a mean. This plot suggests that the null hypothesis that Age is normally distributed should be rejected.

> with(ICU, qqPlot(Age, dist="norm", id=list(method="y", n=2, labels=rownames(ICU)), main="QQ plot of Age"))

[1] 23 97

Using the Shapiro-Wilk normality test, the conclusions drawn from the QQ plot can be supported, with a very low risk of Type I error, base on this p value. The Shapiro Wilk test is sensitive when used on larger datasets, so this result should be taken into account with the other tests used.

> normalityTest(~Age, test="shapiro.test", data=ICU)

    Shapiro-Wilk normality test

data:  Age
W = 0.92836, p-value = 2.507e-08

Using the t-test, given the confidence interval for this attribute is between 54.75 and 60.34, an interval that does not contain 0. (choosing not to add this as I do not have a test mean - need to refresh on what a true mean is )

Tests of normality - HeartRate:

The QQ plot of HeartRate appears to adhere well to the line of best fit, despite several points in the middle of the graph lying along one side of the confidence band. Most of the points here fall within the confidence band. There is a positive skew of this data towards the lower values of HeartRate. Visually, one might conclude that this data is normally distributed.

> with(ICU, qqPlot(HeartRate, dist="norm", id=list(method="y", n=2, labels=rownames(ICU)), main="QQ plot of HeartRate"))

[1] 125  48

The results of the Shapiro-Wilk test very narrowly allow one to reject the null hypothesis of normal distribution. However the p value is very close to the threshold p value of 0.05, which decreases the confidence that might be had to reject normality based on this test and the sample size used.

> normalityTest(~HeartRate, test="shapiro.test", data=ICU)

    Shapiro-Wilk normality test

data:  HeartRate
W = 0.98598, p-value = 0.04478

Test of normality - Systolic:

Similar to the plot of HeartRate, the QQ plot of Systolic displays points that line closely on the line of best fit. The confidence bands here are quite narrow, and it is only along the ends oof the plot the points very clearly exit the confidence bands. One might conclude that the null hypothesis could be rejected for the normal distribution of Systolic based on this plot. There is a very slight negative skew here towards higher values of Systolic.

> with(ICU, qqPlot(Systolic, dist="norm", id=list(method="y", n=2, labels=rownames(ICU)), main="QQ Plot of Systolic"))

[1] 200 179

While more statistically significant compared to the Shapiro Wilk test for normality of the HeartRate distribution, the p value here is not far from 0.05, indicating a 2% risk of Type I error. Based on this test, the null hypothesis could be rejected. The distribution of Systolic appears to be non-normal.

> normalityTest(~Systolic, test="shapiro.test", data=ICU)

    Shapiro-Wilk normality test

data:  Systolic
W = 0.98369, p-value = 0.0204

In reference to Ian Fellows’ comments about tests for normality, these tests for normality do not add much more to understanding of the ICU dataset and the contributions of these three numerical variables to the determination of Status, beyond what is provided by the xray package. In the same vein, as the central limit theorem and bootstrapping have shown, most distributions can become normal with repeated sampling.

10 Relationships with Age

The mean of Age, referring to the descriptive statistics for this attributes, is about 58 years. Visually, it can be seen that the distribution of Age has two peaks, one around 20 years of age and one around 70 years of age. The majority of subjects admitted to the ICU seem to be between 40-80 years old.

> ggplot(ICU, aes(x=Age)) +
+  geom_density(fill="green") +
+  ggtitle("Age") +
+  theme(axis.title.x=element_text(size=16, face="bold", colour="blue")) +
+  theme(axis.text.x=element_text(size=14 )) +
+  annotate("text", x=0.8, y=-0.001, label="Base=315",  size=4)

Critical Values for Difference in Means
Critical values to assess differences in means are based on a degrees of freedom of 199.

> qt(c(0.025), df=199, lower.tail=TRUE)
[1] -1.971957
> qt(c(0.025), df=199, lower.tail=FALSE)
[1] 1.971957

10.1 Cancer by Age

There are three peaks for Ages where ICU admissions involved cancer. These are around 20, 50 and 70 years of age. Given the relatively small number of individuals who had cancer involvement with their admission, and the variability in ages associated, cancer might not be the best predictor of ICU admission.

> ggplot(ICU, aes(x=Age, fill = Cancer)) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Cancer by Age")

Difference in Means

H0: the variances for admissions involving cancer and admission not involving cancer are equal
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of means and variances of Age by Cancer.

> numSummary(ICU[,c("Age"), drop=FALSE], groups=ICU$Cancer, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
        mean       sd se(mean) Age:n
No  57.70556 20.30953 1.513783   180
Yes 56.10000 17.99971 4.024857    20

Since the p-value = 0.5684 for testing the homogeneity of variances is greater than 0.05, we retain the null hypothesis with a 5% risk of a type 1 error and conclude that the variances for admissions involving cancer vs those that do not are equal. As such, the Student t-test is used to analyze whether there was a significant difference in means.

> leveneTest(Age ~ Cancer, data=ICU, center="median")
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value Pr(>F)
group   1  0.3265 0.5684
      198               

As the p-value = 0.735, 0 is within the confidence intervals of -7.736825 to 10.947937 and t = 0.33891 is less than 1.971957, we retain the null hypothesis at a 5% risk level of a type 1 error and conclude that the means for age are the same between the two groups.

> t.test(Age~Cancer, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=ICU)

    Two Sample t-test

data:  Age by Cancer
t = 0.33891, df = 198, p-value = 0.735
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -7.736825 10.947937
sample estimates:
 mean in group No mean in group Yes 
         57.70556          56.10000 

10.2 Infection by Age

ICU admissions that involved infection centered around 60-65 years of age. This association is relatively clear visually looking at the plot below.

> ggplot(ICU, aes(x=Age, fill = Infection )) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Infection by Age")

Difference in Means

H0: the variances for admissions involving infection and those that do not are equal
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of means and variances of Age by Infection.

> numSummary(ICU[,c("Age"), drop=FALSE], groups=ICU$Infection, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
        mean       sd se(mean) Age:n
No  54.93103 21.21842 1.970081   116
Yes 61.15476 17.82545 1.944917    84

Since the p-value = 0.04162 for testing the homogeneity of variances is less than 0.05, we reject the null hypothesis with a 5% risk of a type 1 error and conclude that the variances for admissions involving infection and admissions not involving infection are not equal. As such, the Welch two sample t-test is used to analyze whether there was a significant difference in means.

> leveneTest(Age ~ Infection, data=ICU, center="median")
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value  Pr(>F)  
group   1  4.2053 0.04162 *
      198                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value = 0.02569, 0 is not within the confidence intervals of -11.6837808 to -0.7636741 and t = -2.2481 is less than -1.971957, we reject the null hypothesis at a 5% risk level of a type 1 error and conclude that the means for age are not the same between the two groups. this suggests that there may be a relationship between Age and Infection.

> t.test(Age~Infection, alternative="two.sided", conf.level=.95, var.equal=FALSE, data=ICU)

    Welch Two Sample t-test

data:  Age by Infection
t = -2.2481, df = 193.6, p-value = 0.02569
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.6837808  -0.7636741
sample estimates:
 mean in group No mean in group Yes 
         54.93103          61.15476 

10.3 Consciousness by Age

Another association that was explored was between Age and Consiousness. All three categories for Consciousness overlap with peaks between 50 and 75 years of age. There is a distinct peak at around 50 for deep stupor, indicating that this age might be associated with deep stupor when related to ICU admissions. However, deep stupor also has a secondary peak at around 75, which make it less clearly how this consciousness category relates to age in this dataset.

> ggplot(ICU, aes(x=Age, fill = Consciousness )) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Consciousness by Age")

Difference in Means (ANOVA)

H0: the variances for Age is the same across all levels of consciousness
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of means and variances of Age by Consciousness.

> numSummary(ICU[,c("Age"), drop=FALSE], groups=ICU$Consciousness, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
                mean       sd se(mean) Age:n
Conscious   57.02162 20.50927 1.507871   185
Deep Stupor 61.20000 11.16692 4.993996     5
Coma        65.40000 12.50067 3.953058    10

Testing normality
H0: the several distributions are normally distributed
Ha: the distributions are not normally distributed

According the q-q plot and Shapiro-Wilk normality test for each of the three levels of consciousness, Age does not appear to be normally distributed for the level of consciousness = conscious. This was found as the q-q plot for conscious does not follow the diagonal line and the p-value = 5.319e-08 which is less than 0.05. Additionally, based on the normality test previously performed on Age, the p-value = 2.507e-08 which is less than 0.05. As such, we would reject the null hypothesis at a 5% risk and conclude that the distributions are not normally distributed. That being said, since normality was rejected, the Kruskal-Wallis test should be used to determine whether the four distributions are the same.

> with(ICU, qqPlot(Age, dist="norm", id=list(method="y", n=2, 
+   labels=rownames(ICU)), groups=Consciousness))

> normalityTest(Age ~ Consciousness, test="shapiro.test", data=ICU)

 --------
 Consciousness = Conscious 

    Shapiro-Wilk normality test

data:  Age
W = 0.92706, p-value = 5.319e-08

 --------
 Consciousness = Deep Stupor 

    Shapiro-Wilk normality test

data:  Age
W = 0.9012, p-value = 0.4165

 --------
 Consciousness = Coma 

    Shapiro-Wilk normality test

data:  Age
W = 0.96128, p-value = 0.8003

 --------

 p-values adjusted by the Holm method:
            unadjusted adjusted  
Conscious   5.3186e-08 1.5956e-07
Deep Stupor 0.41654    0.83308   
Coma        0.80031    0.83308   

Testing for Homogeneity of Variances
H0: variances are homogenous
Ha: at least one of the variances is different from the others

As the Levene Test generated a p-value = 0.0942, we would retain the null hypothesis with a 5% risk of a Type 1 error. This means that the difference in variances of Age for the different levels of consciousness is zero, and that the variances are homogeneous.

> leveneTest(Age ~ Consciousness, data=ICU , center=median)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value Pr(>F)  
group   2  2.3909 0.0942 .
      197                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Non-parametric Kruskal-Wallis test
As ‘normality’ was rejected, the Kruskal-Wallis test will be used in order to determine whether the three distributions for Age are the same.

The Kruskal-Wallis test produced a p-value of p-value = 0.6468, which is greater than 0.05. As such, we would retain the null hypothesis and conclude that the distributions are the same and that the means are likely the same.

> kruskal.test(Age ~ Consciousness, data=ICU)

    Kruskal-Wallis rank sum test

data:  Age by Consciousness
Kruskal-Wallis chi-squared = 0.87135, df = 2, p-value = 0.6468

10.4 Type by Age

Both admission Types, elective and emergency, happen more frequently with advancing age. However, elective admissions happen more frequently with older age. This might suggest that older individuals are more likely to have complications during surgical or other procedures that would require admissions to an intensive care unit. Depending on organizational policy or practice, this might also mean that individuals at the study site were premptively admitted to ICU, perhaps more frequently with advancing age, for fear of complications.

> ggplot(ICU, aes(x=Age, fill = Type )) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Type by Age")

Difference in Means

H0: the variances for elective and emergency are equal
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of means and variances of Age by Type.

> numSummary(ICU[,c("Age"), drop=FALSE], groups=ICU$Type, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
              mean       sd se(mean) Age:n
Elective  63.77358 15.49769 2.128772    53
Emergency 55.29932 21.05909 1.736924   147

Since the p-value = 0.001188 for testing the homogeneity of variances is less than 0.05, we reject the null hypothesis with a 5% risk of a type 1 error and conclude that the variances for elective and emergency admission types are not equal. As such, the Welch two sample t-test is used to analyze whether there was a significant difference in means.

> leveneTest(Age ~ Type, data=ICU, center="median")
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value   Pr(>F)   
group   1  10.821 0.001188 **
      198                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value = 0.002513, 0 is not within the confidence intervals of 3.036522 to 13.912009 and t = 3.0844 is greater than 1.971957, we reject the null hypothesis at a 5% risk level of a type 1 error and conclude that the means for age are not the same between the two groups. This further confirms our observations from the graph that there is a difference between the two. Additionally, a relationship may exist between Age and admission type.

> t.test(Age~Type, alternative="two.sided", conf.level=.95, var.equal=FALSE, data=ICU)

    Welch Two Sample t-test

data:  Age by Type
t = 3.0844, df = 124.61, p-value = 0.002513
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  3.036522 13.912009
sample estimates:
 mean in group Elective mean in group Emergency 
               63.77358                55.29932 

10.5 Service by Age

While the peaks for surgical and medical Service by Age overlap considerably, there is interestingly a peak of surgical service type in younger age groups, around 20 years of age. While it is not known what conditions led to these individuals being admitted, it might be concluded that older individuals would have more events that were related to their medication or needing medication, for example falls or stroke. This is a very weak conclusion to draw from the graph below, but if specific reasons for admission were also collected, this theory could be explored.

> ggplot(ICU, aes(x=Age, fill = Service)) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Service by Age")

Difference in Means

H0: the variances for medical and surgical are equal
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of means and variances of Age by Service.

> numSummary(ICU[,c("Age"), drop=FALSE], groups=ICU$Service, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
             mean       sd se(mean) Age:n
Medical  57.45161 18.56511 1.925112    93
Surgical 57.62617 21.35174 2.064150   107

Since the p-value = 0.2151 for testing the homogeneity of variances is greater than 0.05, we retain the null hypothesis with a 5% risk of a type 1 error and conclude that the variances for medical and surgical are equal. As such, the Student two t-test is used to analyze whether there was a significant difference in means.

> leveneTest(Age ~ Service, data=ICU, center="median")
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value Pr(>F)
group   1  1.5464 0.2151
      198               

As the p-value = 0.9512, 0 is within the confidence intervals of -5.795344 to 5.446234 and t = -0.061242 is greater than -1.971957, we retain the null hypothesis at a 5% risk level of a type 1 error and conclude that the means for age are the same between the two groups.

> t.test(Age~Service, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=ICU)

    Two Sample t-test

data:  Age by Service
t = -0.061242, df = 198, p-value = 0.9512
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -5.795344  5.446234
sample estimates:
 mean in group Medical mean in group Surgical 
              57.45161               57.62617 

10.6 Previous by Age

There is a definite an almost independent peak of history of previous ICU admission with older age. It can be noted that there is a smaller peak of relatively older individuals who were admitted who had not had a previous ICU admission within the past 6 months. This indicates that old age does not relate clearly to previous ICU admission and future ICU admission. On the extremes of age, younger individuals are shown to have less frequency of previous ICU admission within the past 6 months, while there is a small peak of older admission of the individuals in this dataset having had a previous ICU admission in the past 6 months. There might be a smaller increased likelihood of ICU admission with previous ICU admission in the past 6 months with age.

> ggplot(ICU, aes(x=Age, fill = Previous )) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Previous by Age")

Difference in Means

H0: the variances for previous admission within the past 6 months and no previous admission within the past 6 months are equal
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of means and variances of Age by Previous.

> numSummary(ICU[,c("Age"), drop=FALSE], groups=ICU$Previous, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
        mean       sd se(mean) Age:n
No  56.96471 20.69528 1.587256   170
Yes 60.83333 15.83554 2.891161    30

Since the p-value = 0.05185 for testing the homogeneity of variances is greater than 0.05, we retain the null hypothesis with a 5% risk of a type 1 error and conclude that the variances for previous admission within the past 6 months and no previous admission within the past 6 months are equal. As such, the Student two t-test is used to analyze whether there was a significant difference in means.

> leveneTest(Age ~ Previous, data=ICU, center="median")
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value  Pr(>F)  
group   1  3.8268 0.05185 .
      198                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value = 0.3312, 0 is within the confidence intervals of -11.701332 to 3.964077 and t = -0.97399 is greater than -1.971957, we retain the null hypothesis at a 5% risk level of a type 1 error and conclude that the means for age are the same between the two groups.

> t.test(Age~Previous, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=ICU)

    Two Sample t-test

data:  Age by Previous
t = -0.97399, df = 198, p-value = 0.3312
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.701332   3.964077
sample estimates:
 mean in group No mean in group Yes 
         56.96471          60.83333 

10.7 Sex by Age

From the plot below, it seems that females who were admitted to the ICU were older than their male counterparts. There are similar peaks and patterns for younger individuals: females were older if admitted to ICU than their male counterparts.

> ggplot(ICU, aes(x=Age, fill = Sex)) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Sex by Age")

Difference in Means
H0: the variances for Males and Females are equal
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of means and variances between the two Males and Females for Age.

> numSummary(ICU[,c("Age"), drop=FALSE], groups=ICU$Sex, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
           mean       sd se(mean) Age:n
Male   56.04032 19.45451 1.747067   124
Female 60.00000 20.89466 2.396781    76

Since the p-value = 0.7344 for testing the homogeneity of variances is greater than 0.05, we retain the null hypothesis with a 5% risk of a type 1 error and conclude that the variances for Males and Females are equal. As such, the Student t-test is used to analyze whether there was a significant difference in means.

> leveneTest(Age ~ Sex, data=ICU, center="median")
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value Pr(>F)
group   1  0.1154 0.7344
      198               

As the p-value = 0.1759, 0 is within the confidence intervals of -9.708824 to 1.789469 and t = -1.3582 is greater than -1.971957, we retain the null hypothesis at a 5% risk level of a type 1 error and conclude that the means for Age are the same between the two groups. This further confirms our observations from the graph that the peaks of the two sexes are similar.

> t.test(Age~Sex, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=ICU)

    Two Sample t-test

data:  Age by Sex
t = -1.3582, df = 198, p-value = 0.1759
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9.708824  1.789469
sample estimates:
  mean in group Male mean in group Female 
            56.04032             60.00000 

10.8 Vital Status by Age

Overlaying Status and Age, density plots show that deaths after admission follow the larger Age peak that contains middle aged to elderly individuals.

> ggplot(ICU, aes(x=Age, fill = Status)) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Vital Status by Age")

When the additional layer of Status, more information can be gathered from the relationship between Age and Sex. When looking at males, older males more often died than younger males, with similar frequency in advanced age, around the age of 65-70. In younger males, those admitted to the ICU were more often to have lived through their admission, where the opposite is seen in older males.

In females, there is a clearer distribution of older individuals who had died. No females in this dataset under the age of about 40 died on ICU admission. There is a clear peak at around 75 years and a smaller peak at around 50 of females who died on ICU admission. There is a higher frequency of younger females who lived through ICU admission, but there remains a higher frequency of women who were older in this group.

> ggplot(ICU, aes(x=Age, fill = Status)) +
+    theme_bw() +
+    facet_wrap(~ Sex) +
+        geom_density(alpha=0.5) +
+    labs(y = "Density",
+         title = "Density distribution of Vital Status in male and female patients by Age")

Difference in Means
H0: the variances for Lived and Died are equal
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of means and variances between the two Status groups for Age.

> numSummary(ICU[,c("Age"), drop=FALSE], groups=ICU$Status, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
        mean       sd se(mean) Age:n
Lived 55.650 20.42818 1.614990   160
Died  65.125 16.64900 2.632438    40

Since the p value = 0.07855 for testing the homogeneity of variances is greater than 0.05, we retain the null hypothesis with a 5% risk of a type 1 error and conclude that the variances for Lived and Died are equal. As such, the Student t-test is used to analyze whether there was a significant difference in means.

> leveneTest(Age ~ Status, data=ICU, center="median")
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value  Pr(>F)  
group   1   3.127 0.07855 .
      198                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value = 0.007211, 0 is not within the confidence intervals of -16.35688 to -2.59312 and t = -2.7151 is less than -1.971957, we reject the null hypothesis and conclude that the means for Age for the groups Lived and Died are not the same. This suggests that there might be a relationship between Age and Vital Status.

> t.test(Age~Status, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=ICU)

    Two Sample t-test

data:  Age by Status
t = -2.7151, df = 198, p-value = 0.007211
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -16.35688  -2.59312
sample estimates:
mean in group Lived  mean in group Died 
             55.650              65.125 

11 Relationships with HeartRate

Using a density estimate plot, the distribution of HeartRate was produced. The main peak here is at about 90 beats per minute. There appears to be a secondary peak around 130 beats per minute. The mean for HeartRate calculated by Rcmdr is 99. Looking at the data, this mean might be said to not accurately portray the frequency of values for HeartRate.

>  ggplot(ICU, aes(x=HeartRate)) +
+  geom_density(fill="green") +
+  ggtitle("HeartRate") +
+  theme(axis.title.x=element_text(size=16, face="bold", colour="blue")) +
+  theme(axis.text.x=element_text(size=14 )) +
+  annotate("text", x=0.8, y=-0.001, label="Base=315",  size=4)

11.1 Vital Status by Heart Rate

Both status categories have similar peaks when plotted on a density estimate plot of HeartRate. This indicates that there might not be a difference in the Status response category based on HeartRate. People who lived did seem to have a higher density of HeartRate values around 90 than those who died. A normal adult heart rate ranges between 60 and 100 beats per minute. This graph might suggest that people who lived more often had heart rates within this range, whereas those who died seemed more likely to have higher heartrates. This could inform further analysis or research.

> ggplot(ICU, aes(x=HeartRate, fill = Status)) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Vital Status by HeartRate")

Difference in Means
H0: the variances for Lived and Died are equal
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of means and variances between the two Status groups for HeartRate.

> numSummary(ICU[,c("HeartRate"), drop=FALSE], groups=ICU$Status, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
         mean       sd se(mean) HeartRate:n
Lived  98.500 26.97868 2.132852         160
Died  100.625 26.49304 4.188918          40

Since the p-value = 0.929 for testing the homogeneity of variances is greater than 0.05, we retain the null hypothesis with a 5% risk of a type 1 error and conclude that the variances for Lived and Died are equal. As such, the Student t-test is used to analyze whether there was a significant difference in means.

> leveneTest(HeartRate ~ Status, data=ICU, center="median")
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value Pr(>F)
group   1   0.008  0.929
      198               

As the p-value = 0.6553, 0 is within the confidence intervals of -11.496845 to 7.246845 and t = -0.44714 is greater than -1.971957, we retain the null hypothesis at a 5% risk level of a type 1 error and conclude that the means for HeartRate are the same among those that lived and those that died. While observations from the graph suggest that people who lived more often had heart rates that were lower, based on the t-test, the means of the two groups are not statistically different.

> t.test(HeartRate~Status, alternative='two.sided', conf.level=.95, var.equal=TRUE, data=ICU)

    Two Sample t-test

data:  HeartRate by Status
t = -0.44714, df = 198, p-value = 0.6553
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -11.496845   7.246845
sample estimates:
mean in group Lived  mean in group Died 
             98.500             100.625 

12 Relationships with Systolic

Below is a density plot of Systolic blood pressure in mmHg. A peak appears around 130-140mmHg. The peak is quite narrow and distinct compared to the density plots of Age and HeartRate, suggesting that the majority of individuals had blood pressure readings that are reflective of this plot. The mean for this attribute was 132mmHg, which seems to be more valid than the means of the other numeric variables, based on the distribution of the frequency of values.

> ggplot(ICU, aes(x=Systolic)) +
+  geom_density(fill="green") +
+  ggtitle("Systolic") +
+  theme(axis.title.x=element_text(size=16, face="bold", colour="blue")) +
+  theme(axis.text.x=element_text(size=14 )) +
+  annotate("text", x=0.8, y=-0.001, label="Base=315",  size=4)

12.1 Vital Status by Systolic

The distribution of Systolic values for those who died seem to be concentrated around 75 mmHg and 140mmHg. A normal Systolic blood pressure can vary widely but the American Heart Association states that blood pressure below 120mmHg is normal. However, excessively low blood pressure could be a result of bleeding, for example, and can result in insufficient blood flow to critical organs. Medications used to restore blood pressure are used when blood pressure becomes too low. This might explain the peak around 75mmHg in the death curve in the graph below.

> ggplot(ICU, aes(x=Systolic, fill = Status)) +
+   theme_bw() +
+       geom_density(alpha=0.5) +
+   labs(y = "Density",
+        title = "Density distribution of Vital Status Systolic")

Difference in Means
H0: the variances for Lived and Died are equal
Ha: the variances are different

H0: the means are equal
Ha: the means are different

The following is the comparison of variances between the two Status groups for Systolic.

> numSummary(ICU[,c("Systolic"), drop=FALSE], groups=ICU$Status, statistics=c("mean","sd", "se(mean)"), quantiles=c(0,.25,.5,.75,1))
          mean       sd se(mean) Systolic:n
Lived 135.6438 29.80151 2.356016        160
Died  118.8250 41.08084 6.495451         40

Since the p-value = 0.04205 for testing the homogeneity of variances is less than 0.05, we reject the null hypothesis with a 5% risk of a type 1 error and conclude that the variances for Lived and Died are not equal. As such, the Welch two Sample t-test is used to analyze whether there was a significant difference in means.

> leveneTest(Systolic ~ Status, data=ICU, center="median")
Levene's Test for Homogeneity of Variance (center = "median")
       Df F value  Pr(>F)  
group   1  4.1872 0.04205 *
      198                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the p-value = 0.01856, 0 is not within the confidence intervals of 2.938642 to 30.698858 and t = 2.4341 is greater than 1.971957, we reject the null hypothesis at a 5% risk level of a type 1 error and conclude that the means for systolic blood pressure are not the same among those that lived and those that died. As such, this might suggest that patients who lived had different systolic blood pressures compared to those that died. Therefore a relationship might exist between systolic blood pressure and vital status.

> t.test(Systolic~Status, alternative="two.sided", conf.level=.95, var.equal=FALSE, data=ICU)

    Welch Two Sample t-test

data:  Systolic by Status
t = 2.4341, df = 49.726, p-value = 0.01856
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  2.938642 30.698858
sample estimates:
mean in group Lived  mean in group Died 
           135.6438            118.8250 

13 Difference in proportions

We will conduct z-tests between two categorical variables where the dependent variable is Vital status and the independent variable is binary.

13.1 Vital status by Service

H0: there is no difference between the proportion of patients that died in the medical versus surgical group, risk = 0.05 Ha: there is a difference between the proportion of patients that died in the medical versus surgical group

> pacman::p_load(sjPlot)
> sjt.xtab(ICU$Status, ICU$Service, show.col.prc = TRUE)
Status Service Total
Medical Surgical
Lived 67
72 %
93
86.9 %
160
80 %
Died 26
28 %
14
13.1 %
40
20 %
Total 93
100 %
107
100 %
200
100 %
χ2=5.981 · df=1 · φ=0.185 · p=0.014
> x1<-26; x2<-14; n1<-93; n2<-107
> p1<-x1/n1; p2<-x2/n2
> p<-(x1+x2)/(n1+n2)  
> varp<-p*(1-p)*(1/n1 + 1/n2)  
> stdp<-sqrt(varp)  
> zp<-(p1 - p2)/stdp 
> zp
[1] 2.622729
> 1-pnorm(zp)
[1] 0.004361436

When conducting a two-tail z-test with a 5% level of risk of a type 1 error, the critical values are -1.96 and +1.96. The z-value of 2.622729 does not fall within this confidence interval, thus we reject the null hypothesis that there is no difference between the proportion of patients dying in the ICU from medical procedures versus surgical procedures.

Further, since the p-value of 0.004361436 is less than 0.05, we, again, reject the null hypothesis that the proportions of the two groups are equal.

Lastly, we can see from the table that more of those that died were there for a medical service and more of those that lived were there for a surgical service.

13.2 Vital status by Cancer

H0: there is no difference between the proportions of patients that died in the cancer versus non-cancer group, risk = 0.05 Ha: there is a difference between the proportions of patients that died in the cancer versus non-cancer group

> pacman::p_load(sjPlot)
> sjt.xtab(ICU$Status, ICU$Cancer, show.col.prc = TRUE)
Status Cancer Total
No Yes
Lived 144
80 %
16
80 %
160
80 %
Died 36
20 %
4
20 %
40
20 %
Total 180
100 %
20
100 %
200
100 %
χ2=0.000 · df=1 · φ=0.000 · Fisher’s p=1.000
> x1<-36; x2<-4; n1<-180; n2<-20
> p1<-x1/n1; p2<-x2/n2
> p<-(x1+x2)/(n1+n2)  
> varp<-p*(1-p)*(1/n1 + 1/n2)  
> stdp<-sqrt(varp)  
> zp<-(p1 - p2)/stdp 
> zp
[1] 0
> 1-pnorm(zp)
[1] 0.5

When conducting a two-tail z-test with a 5% level of risk of a type 1 error, the critical values are -1.96 and +1.96. The z-value of 0 falls within this confidence interval, thus we fail to reject the null hypothesis that there is no difference between the proportions of patients that died with versus without cancer as part of the present problem.

Further, since the p-value of 0.5 is greater than 0.05, we, again, fail to reject the null hypothesis that the proportions of the Cancer versus non-cancer patients that died are equal.

Thus, there are equal proportions of those who died with cancer as part of the present problem and without, as well as equal proportions of those who lived with cancer as part of the present problem and without.

13.3 Vital status by Previous

H0: there is no difference between the proportion of patients that died with versus without previous admission to an ICU within 6 months, risk = 0.05 Ha: there is a difference between the proportion of patients that died with versus without previous admission to an ICU within 6 months

> pacman::p_load(sjPlot)
> sjt.xtab(ICU$Status, ICU$Previous, show.col.prc = TRUE)
Status Previous Total
No Yes
Lived 137
80.6 %
23
76.7 %
160
80 %
Died 33
19.4 %
7
23.3 %
40
20 %
Total 170
100 %
30
100 %
200
100 %
χ2=0.061 · df=1 · φ=0.035 · Fisher’s p=0.624
> x1<-33; x2<-7; n1<-170; n2<-30
> p1<-x1/n1; p2<-x2/n2
> p<-(x1+x2)/(n1+n2)  
> varp<-p*(1-p)*(1/n1 + 1/n2)  
> stdp<-sqrt(varp)  
> zp<-(p1 - p2)/stdp 
> zp
[1] -0.4950738
> 1-pnorm(zp)
[1] 0.689726

When conducting a two-tail z-test with a 5% level of risk of a type 1 error, the critical values are -1.96 and +1.96. The z-value of -0.4950738 falls within this confidence interval, thus we fail to reject the null hypothesis that there is no difference between the proportion of patients that died with versus without previous admission to an ICU within 6 months of the current admission.

Further, since the p-value of 0.689726 is greater than 0.05, we, again, fail to reject the null hypothesis that there is no difference between the proportion of patients that died whether or not they were previously admitted to the ICU within the past 6 months.

Lastly, we see from the table that there was a lower percentage of patients that died with no prior ICU admission than patients that died with prior ICU admission.

13.4 Vital status by Type

Ho: there is no difference between the proportion of patients that died in the elective admission group versus the emergency admission group, risk = 0.05 Ha: there is a difference between the proportion of patients that died in the elective admission group versus the emergency admission group

> pacman::p_load(sjPlot)
> sjt.xtab(ICU$Status, ICU$Type, show.col.prc = TRUE)
Status Type Total
Elective Emergency
Lived 51
96.2 %
109
74.1 %
160
80 %
Died 2
3.8 %
38
25.9 %
40
20 %
Total 53
100 %
147
100 %
200
100 %
χ2=10.527 · df=1 · φ=0.244 · p=0.001
> x1<-2; x2<-38; n1<-53; n2<-147
> p1<-x1/n1; p2<-x2/n2
> p<-(x1+x2)/(n1+n2)  
> varp<-p*(1-p)*(1/n1 + 1/n2)  
> stdp<-sqrt(varp)  
> zp<-(p1 - p2)/stdp 
> zp
[1] -3.444743
> pnorm(zp)
[1] 0.000285801

When conducting a two-tail z-test with a 5% level of risk of a type 1 error, the critical values are -1.96 and +1.96. The z-value of -3.444743 does not fall within this confidence interval, thus we reject the null hypothesis that there is no difference between the proportion of patients that died in the elective admission group versus the emergency admission group.

Further, since the p-value of 0.000285801 is less than 0.05, we, again, reject the null hypothesis that the proportions of the two groups are equal.

We can see from the table that the emergency admission type had a higher proportion of deaths than the elective admission type did.

We will now conduct a prop.test for categorical variables where the independent variable has more than 2 subgroups.

13.5 Vital status by Age

H0: there is no difference between the proportion of patients that died across the 5 age groups, risk = 0.05 Ha: the proportion of patients that died is different in at least one of the 5 age groups

> pacman::p_load(sjPlot)
> sjt.xtab(ICU$Status, ICU$Age.binned, show.col.prc = TRUE)
Status Age.binned Total
Group 1 Group 2 Group 3 Group 4 Group 5
Lived 38
92.7 %
31
79.5 %
34
79.1 %
33
75 %
24
72.7 %
160
80 %
Died 3
7.3 %
8
20.5 %
9
20.9 %
11
25 %
9
27.3 %
40
20 %
Total 41
100 %
39
100 %
43
100 %
44
100 %
33
100 %
200
100 %
χ2=5.930 · df=4 · Cramer’s V=0.172 · p=0.204
> Died  <- c( 3, 8, 9, 11, 9 )
> Total <- c( 41, 39, 43, 44, 33 )
> prop.test(Died, Total)

    5-sample test for equality of proportions without continuity
    correction

data:  Died out of Total
X-squared = 5.93, df = 4, p-value = 0.2044
alternative hypothesis: two.sided
sample estimates:
    prop 1     prop 2     prop 3     prop 4     prop 5 
0.07317073 0.20512821 0.20930233 0.25000000 0.27272727 

Since the p-value of 0.2044 is greater than 0.05, we fail to reject the null hypothesis that there is no difference between the proportion of patients that died across the 5 age groups. We can see from the sample estimates that the proportions are indeed quite similar between the 5 age groups. However, the number of patients that died in age group 1 (the youngest age group) is much lower than the rest. The greatest proportion of patients that died is in group 5 (the oldest age group). The proportions in groups 2 and 3 are quite similar and slightly higher in group 4. However, overall, most of the groups had similar proportions of patients that died, aside from the youngest age group.

Further, when conducting a chi-square test with a 5% level of risk of a type 1 error and 4 degrees of freedom, the critical value is 9.49. Since the chi-squared value of 5.93 is less than this critical value, we, again, fail to reject the null hypothesis that there is no difference between the proportion of patients that died across the 5 age groups.

13.6 Vital status by Systolic Blood Pressure

H0: there is no difference between the proportion of patients that died across the 6 systolic blood pressure level groups, risk = 0.05 Ha: the proportion of patients that died is different in at least one of the 6 systolic blood pressure level groups

> pacman::p_load(sjPlot)
> sjt.xtab(ICU$Status, ICU$Systolic.grouped, show.col.prc = TRUE)
Status Systolic.grouped Total
elevated hypotension normal stage 1 hypertension stage 2 hypertension stage 3 hypertension
Lived 14
82.4 %
3
25 %
52
85.2 %
24
80 %
54
83.1 %
13
86.7 %
160
80 %
Died 3
17.6 %
9
75 %
9
14.8 %
6
20 %
11
16.9 %
2
13.3 %
40
20 %
Total 17
100 %
12
100 %
61
100 %
30
100 %
65
100 %
15
100 %
200
100 %
χ2=24.597 · df=5 · Cramer’s V=0.351 · Fisher’s p=0.002
> Died  <- c( 3, 9, 9, 6, 11, 2 )
> Total <- c( 17, 12, 61, 30, 65, 15 )
> prop.test(Died, Total)

    6-sample test for equality of proportions without continuity
    correction

data:  Died out of Total
X-squared = 24.597, df = 5, p-value = 0.0001667
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3    prop 4    prop 5    prop 6 
0.1764706 0.7500000 0.1475410 0.2000000 0.1692308 0.1333333 

Since the p-value of 0.0001667 is less than 0.05, we will reject the null hypothesis that there is no difference between the proportion of patients that died in the 6 systolic blood pressure level groups. We can see from the sample estimates that the proportions are indeed different between the 6 groups, with the highest proportion in the hypotenstion range and in the stage 1 hypertension range. The proportions in the elevated, normal, stage 2 hypertension and stage 3 hypertension groups are much lower and are quite close to each other. Thus, the hypotension and stage 1 hypertension groups had higher proportions of patients that died.

Further, when conducting a chi-square test with a 5% level of risk of a type 1 error and 5 degrees of freedom, the critical value is 11.07. Since the chi-squared value of 24.597 is greater than this critical value, we will, again, reject the null hypothesis that there is no difference between the proportion of patients that died in the 6 systolic blood pressure level groups.

13.7 Vital status by Heart Rate

H0: there is no difference between the proportion of patients that died across the bradychardia, normal heart rate and elevated heart rate groups, risk = 0.05 Ha: the proportion of patients that died is different in at least one of the 3 heart rate categories

> pacman::p_load(sjPlot)
> sjt.xtab(ICU$Status, ICU$HeartRate.grouped, show.col.prc = TRUE)
Status HeartRate.grouped Total
bradycardia elevated normal
Lived 12
80 %
64
80 %
84
80 %
160
80 %
Died 3
20 %
16
20 %
21
20 %
40
20 %
Total 15
100 %
80
100 %
105
100 %
200
100 %
χ2=0.000 · df=2 · Cramer’s V=0.000 · Fisher’s p=1.000
> Died  <- c( 3, 16, 21 )
> Total <- c( 15, 80, 105 )
> prop.test(Died, Total)

    3-sample test for equality of proportions without continuity
    correction

data:  Died out of Total
X-squared = 0, df = 2, p-value = 1
alternative hypothesis: two.sided
sample estimates:
prop 1 prop 2 prop 3 
   0.2    0.2    0.2 

Since the p-value of 1 is greater than 0.05, we fail to reject the null hypothesis that there is no difference between proportion of patients that died across the bradychardia, normal heart rate and elevated heart rate groups. An exact p-value of 1 means that the difference in proportions of patients that died between the 3 heart rate groups is exactly 0. Thus, there were an equal number of patients that died with bradycardia, normal heart rate and elevated heart rate.

Further, when conducting a chi-square test with a 5% level of risk of a type 1 error and 2 degrees of freedom, the critical value is 5.99. Since the chi-squared value of 0 is lower than this critical value, we, again, fail to reject the null hypothesis that there is no difference between the proportion of patients that died having bradychardia, normal heart rate and elevated heart rate.

13.8 Vital status by Race

H0: there is no difference between the proportion of patients that died that were White, Black or otherwise, risk = 0.05 Ha: the proportion of patients that died is different in at least one of the race categories

> pacman::p_load(sjPlot)
> sjt.xtab(ICU$Status, ICU$Race, show.col.prc = TRUE)
Status Race Total
White Black Other
Lived 138
78.9 %
14
93.3 %
8
80 %
160
80 %
Died 37
21.1 %
1
6.7 %
2
20 %
40
20 %
Total 175
100 %
15
100 %
10
100 %
200
100 %
χ2=1.810 · df=2 · Cramer’s V=0.095 · Fisher’s p=0.497
> Died  <- c( 37, 1, 2 )
> Total <- c( 175, 15, 10 )
> prop.test(Died, Total)

    3-sample test for equality of proportions without continuity
    correction

data:  Died out of Total
X-squared = 1.8095, df = 2, p-value = 0.4046
alternative hypothesis: two.sided
sample estimates:
    prop 1     prop 2     prop 3 
0.21142857 0.06666667 0.20000000 

Since the p-value of 0.4046 is greater than 0.05, we fail to reject the null hypothesis that there is no difference between the proportion of patients that died that were White, Black or otherwise.

Further, when conducting a chi-square test with a 5% level of risk of a type 1 error and 2 degrees of freedom, the critical value is 5.99. Since the chi-squared value of 1.8095 is less than this critical value, we, again, fail to reject the null hypothesis that there is no difference between the proportion of patients that died that were White, Black or otherwise.

When looking at the sample estimates and at the table, we see that there was almost no difference between the proportion of White patients that died and the proportion of ‘Other’ patients that died. However, the proportion of Black patients that died was slightly lower than the White and ‘Other’ groups.

13.9 Vital status by Consciousness

H0: there is no difference between the proportion of patients that died across the Conscious, Deep Stupor and Coma groups, risk = 0.05 Ha: the proportion of patients that died is different in at least one of the Conscious, Deep Stupor and Coma groups

> pacman::p_load(sjPlot)
> sjt.xtab(ICU$Status, ICU$Consciousness, show.col.prc = TRUE)
Status Consciousness Total
Conscious Deep Stupor Coma
Lived 158
85.4 %
0
0 %
2
20 %
160
80 %
Died 27
14.6 %
5
100 %
8
80 %
40
20 %
Total 185
100 %
5
100 %
10
100 %
200
100 %
χ2=45.878 · df=2 · Cramer’s V=0.479 · Fisher’s p=0.000
> Died  <- c( 27, 5, 8 )
> Total <- c( 185, 5, 10 )
> prop.test(Died, Total)

    3-sample test for equality of proportions without continuity
    correction

data:  Died out of Total
X-squared = 45.878, df = 2, p-value = 1.091e-10
alternative hypothesis: two.sided
sample estimates:
   prop 1    prop 2    prop 3 
0.1459459 1.0000000 0.8000000 

Since the p-value of 1.091e-10 is less than 0.05, we will reject the null hypothesis that there is no difference between the proportion of patients that died across the Conscious, Deep Stupor and Coma groups.

Further, when conducting a chi-square test with a 5% level of risk of a type 1 error and 2 degrees of freedom, the critical value is 5.99. Since the chi-squared value of 45.878 is greater than this critical value, we will, again, reject the null hypothesis that there is no difference between the proportion of patients that died across the 3 Consciousness groups.

With a proportion of 1, the Deep Stupor group had the highest proportion of patients that died as all the patients with a deep stupor died. The Coma group has the second highest proportion of patients that died with a sample estimate of 80% dying. The lowest number of deaths was in the group of patients that had no coma or stupor, which is expected. Those with a deep stupor had a higher proportion of deaths than those in a coma.

From the tests of equal proportions, we note that the the following variables had a statistically significant impact on the Vital status of the patient: Service (type of procedure), Type (of admission) and Conciousness.

The following variables were not significant predictors of Vital status of the patient: Cancer, Previous (ICU admission within the last 6 months), Age, Systolic Blood Pressure, Heart Rate and Race.

14 Detection of outliers of Int Variables

> par(mfrow=c(1,3))
> boxplot(ICU$Age, main="Boxplot of Age")
> boxplot(ICU$Systolic, main="Boxplot of Systolic Blood Pressure")
> boxplot(ICU$HeartRate, main="Boxplot of Heart Rate")

As we can see above, the outliers are present in systolic blood pressure and heart rate. To see the values of the outliers, please see below, where the first row includes the outliers for systolic blood pressure, and the second includes outliers for heart rate.

> systolic_outlier<-boxplot.stats(ICU$Systolic)
> heartrate_outlier<-boxplot.stats(ICU$HeartRate)
> systolic_outlier$out
[1]  48 212 224  36 256
> heartrate_outlier$out
[1] 192

15 Crosstabs (relations between categorical variables)

> sjt.xtab(ICU$Status, ICU$Sex, show.col.prc = TRUE)
Status Sex Total
Male Female
Lived 100
80.6 %
60
78.9 %
160
80 %
Died 24
19.4 %
16
21.1 %
40
20 %
Total 124
100 %
76
100 %
200
100 %
χ2=0.012 · df=1 · φ=0.021 · p=0.913
> sjt.xtab(ICU$Status, ICU$Race, show.col.prc = TRUE)
Status Race Total
White Black Other
Lived 138
78.9 %
14
93.3 %
8
80 %
160
80 %
Died 37
21.1 %
1
6.7 %
2
20 %
40
20 %
Total 175
100 %
15
100 %
10
100 %
200
100 %
χ2=1.810 · df=2 · Cramer’s V=0.095 · Fisher’s p=0.517
> sjt.xtab(ICU$Status, ICU$Service, show.col.prc = TRUE)
Status Service Total
Medical Surgical
Lived 67
72 %
93
86.9 %
160
80 %
Died 26
28 %
14
13.1 %
40
20 %
Total 93
100 %
107
100 %
200
100 %
χ2=5.981 · df=1 · φ=0.185 · p=0.014
> sjt.xtab(ICU$Status, ICU$Cancer, show.col.prc = TRUE)
Status Cancer Total
No Yes
Lived 144
80 %
16
80 %
160
80 %
Died 36
20 %
4
20 %
40
20 %
Total 180
100 %
20
100 %
200
100 %
χ2=0.000 · df=1 · φ=0.000 · Fisher’s p=1.000
> sjt.xtab(ICU$Status, ICU$Renal, show.col.prc = TRUE)
Status Renal Total
No Yes
Lived 149
82.3 %
11
57.9 %
160
80 %
Died 32
17.7 %
8
42.1 %
40
20 %
Total 181
100 %
19
100 %
200
100 %
χ2=4.976 · df=1 · φ=0.179 · Fisher’s p=0.029
> sjt.xtab(ICU$Status, ICU$Infection, show.col.prc = TRUE)
Status Infection Total
No Yes
Lived 100
86.2 %
60
71.4 %
160
80 %
Died 16
13.8 %
24
28.6 %
40
20 %
Total 116
100 %
84
100 %
200
100 %
χ2=5.759 · df=1 · φ=0.182 · p=0.016
> sjt.xtab(ICU$Status, ICU$CPR, show.col.prc = TRUE)
Status CPR Total
No Yes
Lived 154
82.4 %
6
46.2 %
160
80 %
Died 33
17.6 %
7
53.8 %
40
20 %
Total 187
100 %
13
100 %
200
100 %
χ2=7.821 · df=1 · φ=0.223 · Fisher’s p=0.005
> sjt.xtab(ICU$Status, ICU$Previous, show.col.prc = TRUE)
Status Previous Total
No Yes
Lived 137
80.6 %
23
76.7 %
160
80 %
Died 33
19.4 %
7
23.3 %
40
20 %
Total 170
100 %
30
100 %
200
100 %
χ2=0.061 · df=1 · φ=0.035 · Fisher’s p=0.624
> sjt.xtab(ICU$Status, ICU$Type, show.col.prc = TRUE)
Status Type Total
Elective Emergency
Lived 51
96.2 %
109
74.1 %
160
80 %
Died 2
3.8 %
38
25.9 %
40
20 %
Total 53
100 %
147
100 %
200
100 %
χ2=10.527 · df=1 · φ=0.244 · p=0.001
> sjt.xtab(ICU$Status, ICU$Fracture, show.col.prc = TRUE)
Status Fracture Total
No Yes
Lived 148
80 %
12
80 %
160
80 %
Died 37
20 %
3
20 %
40
20 %
Total 185
100 %
15
100 %
200
100 %
χ2=0.000 · df=1 · φ=0.000 · Fisher’s p=1.000
> sjt.xtab(ICU$Status, ICU$PO2, show.col.prc = TRUE)
Status PO2 Total
No Yes
Lived 149
81 %
11
68.8 %
160
80 %
Died 35
19 %
5
31.2 %
40
20 %
Total 184
100 %
16
100 %
200
100 %
χ2=0.718 · df=1 · φ=0.083 · Fisher’s p=0.324
> sjt.xtab(ICU$Status, ICU$PH, show.col.prc = TRUE)
Status PH Total
No Yes
Lived 151
80.7 %
9
69.2 %
160
80 %
Died 36
19.3 %
4
30.8 %
40
20 %
Total 187
100 %
13
100 %
200
100 %
χ2=0.416 · df=1 · φ=0.071 · Fisher’s p=0.297
> sjt.xtab(ICU$Status, ICU$PCO2, show.col.prc = TRUE)
Status PCO2 Total
No Yes
Lived 144
80 %
16
80 %
160
80 %
Died 36
20 %
4
20 %
40
20 %
Total 180
100 %
20
100 %
200
100 %
χ2=0.000 · df=1 · φ=0.000 · Fisher’s p=1.000
> sjt.xtab(ICU$Status, ICU$Bicarbonate, show.col.prc = TRUE)
Status Bicarbonate Total
No Yes
Lived 150
81.1 %
10
66.7 %
160
80 %
Died 35
18.9 %
5
33.3 %
40
20 %
Total 185
100 %
15
100 %
200
100 %
χ2=1.014 · df=1 · φ=0.095 · Fisher’s p=0.187
> sjt.xtab(ICU$Status, ICU$Creatinine, show.col.prc = TRUE)
Status Creatinine Total
No Yes
Lived 155
81.6 %
5
50 %
160
80 %
Died 35
18.4 %
5
50 %
40
20 %
Total 190
100 %
10
100 %
200
100 %
χ2=4.112 · df=1 · φ=0.172 · Fisher’s p=0.029
> sjt.xtab(ICU$Status, ICU$Consciousness, show.col.prc = TRUE)
Status Consciousness Total
Conscious Deep Stupor Coma
Lived 158
85.4 %
0
0 %
2
20 %
160
80 %
Died 27
14.6 %
5
100 %
8
80 %
40
20 %
Total 185
100 %
5
100 %
10
100 %
200
100 %
χ2=45.878 · df=2 · Cramer’s V=0.479 · Fisher’s p=0.000

Statistically significant crosstabs include: Creatinine, Type, CPR, Infection,Renal, Service.

16 Logistic Regression

Prior to conducting a logistic regression, we would like to break down the systolic blood pressure into levels. We will use clinically defined levels for hypo, normal, elevated and hyper. Ranges below 90 are considered hypo, whereas ranges from 90 to 120 are considered normal, ranges between 120 and 129 are considered elevated, and 130 plus are considered hyper.

> ICU$Systolic<-cut(ICU$Systolic, breaks=c(0,89,119,129,256))
> levels(ICU$Systolic)<-c("hypo", "normal", "elevated", "hyper")

Below is the logistic regression model with all the predictors in the dataset. As we can see, key predictors that are shown to be statistically significant are age, cancer, systolic pressure, fracture, arterial blood gas concentration, type, and consciousness. We have some confidence in our model due to the small difference between the null and residual deviance. Some of our confidence intervals have zero in them, leading us to have some doubt in our model.

> ICU.m<-glm(formula=Status~Age+Sex+Race+Service+Cancer+Renal+Infection+CPR+Systolic+HeartRate+Previous+Type+Fracture+PO2+PH+PCO2+Bicarbonate+Creatinine+Consciousness, family=binomial(logit), data=ICU)
> summary(ICU.m)

Call:
glm(formula = Status ~ Age + Sex + Race + Service + Cancer + 
    Renal + Infection + CPR + Systolic + HeartRate + Previous + 
    Type + Fracture + PO2 + PH + PCO2 + Bicarbonate + Creatinine + 
    Consciousness, family = binomial(logit), data = ICU)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.45809  -0.52365  -0.17347  -0.00019   2.92298  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)   
(Intercept)              -5.913e+00  2.250e+00  -2.627  0.00861 **
Age                       5.124e-02  1.829e-02   2.802  0.00508 **
SexFemale                -6.228e-01  5.563e-01  -1.119  0.26295   
RaceBlack                -1.628e+01  1.320e+03  -0.012  0.99015   
RaceOther                 4.335e-01  1.312e+00   0.330  0.74115   
ServiceSurgical          -5.792e-01  6.411e-01  -0.904  0.36625   
CancerYes                 3.157e+00  1.141e+00   2.766  0.00567 **
RenalYes                  1.605e-01  8.434e-01   0.190  0.84906   
InfectionYes             -4.125e-02  5.657e-01  -0.073  0.94186   
CPRYes                    1.347e+00  9.901e-01   1.360  0.17372   
Systolicnormal           -2.175e+00  1.044e+00  -2.084  0.03713 * 
Systolicelevated         -1.612e+00  1.114e+00  -1.447  0.14782   
Systolichyper            -2.061e+00  1.026e+00  -2.008  0.04465 * 
HeartRate                -9.748e-04  1.029e-02  -0.095  0.92455   
PreviousYes               1.013e+00  6.937e-01   1.460  0.14439   
TypeEmergency             3.396e+00  1.330e+00   2.553  0.01068 * 
FractureYes               1.566e+00  1.101e+00   1.422  0.15506   
PO2Yes                   -7.044e-01  9.923e-01  -0.710  0.47781   
PHYes                     1.526e+00  1.229e+00   1.241  0.21463   
PCO2Yes                  -2.447e+00  1.235e+00  -1.981  0.04756 * 
BicarbonateYes           -8.800e-03  9.056e-01  -0.010  0.99225   
CreatinineYes            -2.035e-01  1.139e+00  -0.179  0.85827   
ConsciousnessDeep Stupor  3.538e+01  2.568e+03   0.014  0.98901   
ConsciousnessComa         3.373e+00  1.359e+00   2.482  0.01307 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 112.67  on 176  degrees of freedom
AIC: 160.67

Number of Fisher Scoring iterations: 17
> confint(ICU.m)
                                 2.5 %      97.5 %
(Intercept)               -10.71446902 -1.69601255
Age                         0.01835185  0.09076391
SexFemale                  -1.76222223  0.44150426
RaceBlack                -406.17117888 48.28576524
RaceOther                  -2.78151725  2.76940774
ServiceSurgical            -1.87453755  0.66822713
CancerYes                   1.04815774  5.66235244
RenalYes                   -1.58805858  1.78878733
InfectionYes               -1.17641537  1.06336175
CPRYes                     -0.63018852  3.34518326
Systolicnormal             -4.37115591 -0.21283200
Systolicelevated           -3.96121708  0.47666682
Systolichyper              -4.25385307 -0.16344121
HeartRate                  -0.02182989  0.01893995
PreviousYes                -0.38940841  2.37091562
TypeEmergency               1.20038241  6.70600040
FractureYes                -0.77450551  3.72610086
PO2Yes                     -2.80364875  1.17162344
PHYes                      -0.83446254  4.05350727
PCO2Yes                    -5.12715251 -0.23880882
BicarbonateYes             -1.88096203  1.74562784
CreatinineYes              -2.55264347  1.98603281
ConsciousnessDeep Stupor -229.97748940          NA
ConsciousnessComa           0.93736975  6.35058207
> exp(coef(ICU.m))
             (Intercept)                      Age                SexFemale 
            2.704937e-03             1.052574e+00             5.364495e-01 
               RaceBlack                RaceOther          ServiceSurgical 
            8.463168e-08             1.542613e+00             5.603338e-01 
               CancerYes                 RenalYes             InfectionYes 
            2.350715e+01             1.174101e+00             9.595848e-01 
                  CPRYes           Systolicnormal         Systolicelevated 
            3.845560e+00             1.135578e-01             1.995610e-01 
           Systolichyper                HeartRate              PreviousYes 
            1.273357e-01             9.990257e-01             2.752703e+00 
           TypeEmergency              FractureYes                   PO2Yes 
            2.983051e+01             4.788341e+00             4.944214e-01 
                   PHYes                  PCO2Yes           BicarbonateYes 
            4.598443e+00             8.656750e-02             9.912381e-01 
           CreatinineYes ConsciousnessDeep Stupor        ConsciousnessComa 
            8.159074e-01             2.322342e+15             2.915368e+01 

16.1 Final model

> ICU.final.m<-glm(formula=Status~Age+Cancer+Systolic+PCO2+Consciousness+Type, family=binomial(logit),data=ICU)
> summary(ICU.final.m)

Call:
glm(formula = Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + 
    Type, family = binomial(logit), data = ICU)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9242  -0.5616  -0.3078  -0.1039   2.5967  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)   
(Intercept)                -5.46992    1.78152  -3.070  0.00214 **
Age                         0.03715    0.01318   2.819  0.00481 **
CancerYes                   2.55498    1.06544   2.398  0.01648 * 
Systolicnormal             -2.17702    0.87730  -2.481  0.01308 * 
Systolicelevated           -1.71982    0.94888  -1.812  0.06991 . 
Systolichyper              -2.02455    0.80102  -2.527  0.01149 * 
PCO2Yes                    -1.46173    0.86529  -1.689  0.09116 . 
ConsciousnessDeep Stupor   20.75726 1561.13214   0.013  0.98939   
ConsciousnessComa           2.76952    0.97924   2.828  0.00468 **
TypeEmergency               3.65070    1.28917   2.832  0.00463 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 200.16  on 199  degrees of freedom
Residual deviance: 125.60  on 190  degrees of freedom
AIC: 145.6

Number of Fisher Scoring iterations: 16
> confint(ICU.final.m)
                                 2.5 %      97.5 %
(Intercept)                -9.46575467 -2.26757094
Age                         0.01291117  0.06506014
CancerYes                   0.59112234  4.95935713
Systolicnormal             -4.01598420 -0.51648433
Systolicelevated           -3.71304718  0.07016753
Systolichyper              -3.71337117 -0.50242069
PCO2Yes                    -3.36824530  0.08432436
ConsciousnessDeep Stupor -140.94294699          NA
ConsciousnessComa           0.99821470  4.98622776
TypeEmergency               1.60812369  6.92583339
> exp(coef(ICU.final.m))
             (Intercept)                      Age                CancerYes 
            4.211568e-03             1.037849e+00             1.287104e+01 
          Systolicnormal         Systolicelevated            Systolichyper 
            1.133791e-01             1.790977e-01             1.320539e-01 
                 PCO2Yes ConsciousnessDeep Stupor        ConsciousnessComa 
            2.318346e-01             1.034580e+09             1.595094e+01 
           TypeEmergency 
            3.850154e+01 
> pacman::p_load(rsample)
> set.seed(78)
> train_test_split <- initial_split(ICU)
> train <- training(train_test_split)
> test <- testing(train_test_split)
> (samp <- dim(train_test_split))
  analysis assessment          n          p 
       150         50        200         24 
> ICU.tr<-glm(Status~Age+Cancer+Systolic+PCO2+Consciousness+Type, family=binomial(logit),data=train)
> summary(ICU.tr)

Call:
glm(formula = Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + 
    Type, family = binomial(logit), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9703  -0.5116  -0.2826  -0.1133   2.5197  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)   
(Intercept)                -5.04848    1.93964  -2.603  0.00925 **
Age                         0.03639    0.01569   2.319  0.02042 * 
CancerYes                   2.56538    1.09773   2.337  0.01944 * 
Systolicnormal             -2.44624    1.06730  -2.292  0.02191 * 
Systolicelevated           -2.27752    1.23500  -1.844  0.06516 . 
Systolichyper              -1.99802    0.96884  -2.062  0.03918 * 
PCO2Yes                    -1.38288    0.95161  -1.453  0.14617   
ConsciousnessDeep Stupor   19.82640 1234.36430   0.016  0.98718   
ConsciousnessComa           2.70147    1.01560   2.660  0.00781 **
TypeEmergency               3.36914    1.28536   2.621  0.00876 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 144.406  on 149  degrees of freedom
Residual deviance:  89.393  on 140  degrees of freedom
AIC: 109.39

Number of Fisher Scoring iterations: 15

16.2 Testing & training the model

> predicted.val<-predict(ICU.tr, newdata=test)
> head(predicted.val)
        1         4        13        16        19        20 
-2.694963 -1.712560 -4.645074 -5.154468 -1.385092 -4.706261 
> predicted.prob <- predict(ICU.tr, test, type = "response")
> head( predict( ICU.tr, test, type="response") )
          1           4          13          16          19          20 
0.063271249 0.152831961 0.009517366 0.005740407 0.200192370 0.008957547 
> predicted.classes <- ifelse( predicted.prob > 0.5, "Lived", "Dead" )
> head(predicted.classes)
     1      4     13     16     19     20 
"Dead" "Dead" "Dead" "Dead" "Dead" "Dead" 
> pacman::p_load(car)
> car::vif(ICU.tr)
                  GVIF Df GVIF^(1/(2*Df))
Age           1.088868  1        1.043488
Cancer        1.541005  1        1.241372
Systolic      1.351654  3        1.051504
PCO2          1.430087  1        1.195862
Consciousness 1.162778  2        1.038423
Type          1.439780  1        1.199909
> fit0<-glm(Status~1, family=binomial(logit), data=train)
> summary(fit0)

Call:
glm(formula = Status ~ 1, family = binomial(logit), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6428  -0.6428  -0.6428  -0.6428   1.8322  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.4718     0.2095  -7.024 2.16e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 144.41  on 149  degrees of freedom
Residual deviance: 144.41  on 149  degrees of freedom
AIC: 146.41

Number of Fisher Scoring iterations: 4
> fitall<-glm(Status~Age+Cancer+Systolic+PCO2+Consciousness+Type, family=binomial(logit), data=train)
> summary(fitall)

Call:
glm(formula = Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + 
    Type, family = binomial(logit), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9703  -0.5116  -0.2826  -0.1133   2.5197  

Coefficients:
                           Estimate Std. Error z value Pr(>|z|)   
(Intercept)                -5.04848    1.93964  -2.603  0.00925 **
Age                         0.03639    0.01569   2.319  0.02042 * 
CancerYes                   2.56538    1.09773   2.337  0.01944 * 
Systolicnormal             -2.44624    1.06730  -2.292  0.02191 * 
Systolicelevated           -2.27752    1.23500  -1.844  0.06516 . 
Systolichyper              -1.99802    0.96884  -2.062  0.03918 * 
PCO2Yes                    -1.38288    0.95161  -1.453  0.14617   
ConsciousnessDeep Stupor   19.82640 1234.36430   0.016  0.98718   
ConsciousnessComa           2.70147    1.01560   2.660  0.00781 **
TypeEmergency               3.36914    1.28536   2.621  0.00876 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 144.406  on 149  degrees of freedom
Residual deviance:  89.393  on 140  degrees of freedom
AIC: 109.39

Number of Fisher Scoring iterations: 15
> anova(fitall, fit0, test="Chisq")
Analysis of Deviance Table

Model 1: Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + Type
Model 2: Status ~ 1
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       140     89.393                          
2       149    144.406 -9  -55.013 1.211e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> pacman::p_load(survey)
> regTermTest(ICU.tr, "Age")
Wald test for Age
 in glm(formula = Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + 
    Type, family = binomial(logit), data = train)
F =  5.375635  on  1  and  140  df: p= 0.021868 
> regTermTest(ICU.tr, "Cancer")
Wald test for Cancer
 in glm(formula = Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + 
    Type, family = binomial(logit), data = train)
F =  5.461517  on  1  and  140  df: p= 0.020858 
> regTermTest(ICU.tr, "Systolic")
Wald test for Systolic
 in glm(formula = Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + 
    Type, family = binomial(logit), data = train)
F =  1.890836  on  3  and  140  df: p= 0.13391 
> regTermTest(ICU.tr, "Type")
Wald test for Type
 in glm(formula = Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + 
    Type, family = binomial(logit), data = train)
F =  6.870492  on  1  and  140  df: p= 0.0097314 
> regTermTest(ICU.tr, "PCO2")
Wald test for PCO2
 in glm(formula = Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + 
    Type, family = binomial(logit), data = train)
F =  2.111782  on  1  and  140  df: p= 0.14841 
> regTermTest(ICU.tr, "Consciousness")
Wald test for Consciousness
 in glm(formula = Status ~ Age + Cancer + Systolic + PCO2 + Consciousness + 
    Type, family = binomial(logit), data = train)
F =  3.537891  on  2  and  140  df: p= 0.031702 

17 Classification and Regression Trees

> pacman::p_load(rpart, rpart.plot, rattle, dplyr)
> set.seed(2715)
> 
> #final model 
> icutree=rpart(Status~Age+Cancer+Systolic+Type+PCO2+Consciousness+Type+Fracture+Sex+Race+Service+Renal+Infection+CPR+HeartRate+Previous+PO2+PH+Bicarbonate+Creatinine+Consciousness, method="class", data=ICU)
> #rpart.plot(icutree,
>            #extra = 104, # show fitted class, probs, percentages
>            #box.palette = "GnYlRd", # color scheme
>            #prefix = "Status\n",
>            #branch.lty = 1, # solid branch lines, 3 = dotted
>            #shadow.col = "gray", # shadows under the node boxes
>            #split.prefix = "is ", # put "is " before split text
>            #split.suffix = "?", # put "?" after split text
>            #nn = TRUE, # display the node numbers
>            #tweak = 1.2)
> fancyRpartPlot(icutree)

> printcp(icutree)

Classification tree:
rpart(formula = Status ~ Age + Cancer + Systolic + Type + PCO2 + 
    Consciousness + Type + Fracture + Sex + Race + Service + 
    Renal + Infection + CPR + HeartRate + Previous + PO2 + PH + 
    Bicarbonate + Creatinine + Consciousness, data = ICU, method = "class")

Variables actually used in tree construction:
[1] Consciousness Systolic     

Root node error: 40/200 = 0.2

n= 200 

     CP nsplit rel error xerror    xstd
1 0.275      0     1.000  1.000 0.14142
2 0.075      1     0.725  0.725 0.12449
3 0.010      2     0.650  0.725 0.12449
> predicted.classes <- icutree %>% predict(, data=icu, type = "class")
> head(predicted.classes, 12)
    1     2     3     4     5     6     7     8     9    10    11    12 
Lived Lived Lived Lived Lived Lived Lived Lived Lived Lived Lived Lived 
Levels: Lived Died
> mean(predicted.classes == ICU$Status)
[1] 0.87
> plotcp(icutree)

 

A work by Ekene Olatunji, Caterine Nassralla, Victoria Chin, Bassam Chamas, and Sajiya Somji

Msc. eHealth

DeGroote School of Business

eHealth 705 Statistics for eHealth - Final Project